Introduction

Here we look at using InferCNV to identify tumor and normal cells in SCPCS000490.

Before running this notebook, InferCNV is run using the run-infercnv.R script. This script runs InferCNV once with Endothelial cells as the reference cells and once without any cells as the reference.

Here, we look the copy number variations found reported by InferCNV and use that copy number information to call cells as tumor or normal cells. We then compare these classifications to manual classifications and annotations from SingleR and CellAssign.

Setup

suppressPackageStartupMessages({
  # load required packages
  library(SingleCellExperiment)
  library(ggplot2)
  library(dplyr)
})
# The base path for the OpenScPCA repository, found by its (hidden) .git directory
repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")
sample_dir <- file.path(data_dir, "SCPCP000015", params$sample_id)

# The path to this module
module_base <- file.path(repository_base, "analyses", "ewings-cell-type")
# source in helper functions
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")
source(jaccard_functions)
# Input files
sce_filename <- glue::glue("{params$library_id}_processed.rds")
sce_file <- file.path(sample_dir, sce_filename)

# tumor normal classifications 
classifications_results_dir <- file.path(module_base, "results", "marker_gene_analysis")
classifications_filename <- glue::glue("{params$library_id}_tumor_normal_classifications.tsv")
classifications_file <- file.path(classifications_results_dir, classifications_filename)

# output from running infercnv 
results_dir <- file.path(module_base, "results", "infercnv", params$library_id)

all_results_dir_list <- c(
  no_ref = file.path(results_dir, "no_reference"),
  with_ref = file.path(results_dir, "endothelial_reference")
)
# read in sce object
sce <- readr::read_rds(sce_file)

# read in tumor normal classifications 
manual_classifications_df <- readr::read_tsv(classifications_file)
## Rows: 4290 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): barcodes, marker_gene_classification, cluster_classification
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

InferCNV Heatmap

Endothelial cells as reference

Below is the heatmap produced by InferCNV showing the CNV’s for each cell using endothelial cells as a reference. Each row is a cell and each column is a gene. The top group contains all cells that were used as the reference group and the bottom is all other cells. The color indicates the relative expression intensity of each gene with the darker colors either indicating a gain of expression or loss of expression compared to the reference cells.

InferCNV Heatmap - Endothelial cells as reference
InferCNV Heatmap - Endothelial cells as reference

From this heatmap we see that there is a distinct group of cells within the bottom portion of the plot that appear to have a lot more CNVs than the rest of the cells. It’s probable that these correspond to the tumor cells, but InferCNV does not provide any classification of cells as tumor or normal, it just provides information on copy number variations.

No reference

Below is the heatmap produced by InferCNV showing the CNV’s for each cell without using a reference.

From the InferCNV documentation:

Note, if you do not have reference cells, you can set ref_group_names=NULL, in which case the average signal across all cells will be used to define the baseline. This can work well when there are sufficient differences among the cells included (ie. they do not all show a chromosomal deletion at the same place).

Given this information, I wouldn’t expect the CNV inference to work well because we see that all cells that do have prominent CNVs have the same CNVs. There is not a lot of variation.

InferCNV Heatmap - No reference
InferCNV Heatmap - No reference

Looking at this we see a small group of cells that may have some variations, but they are different from the variations seen above. I don’t know that using this without a reference is the best call since it’s unlikely that it will identify distinct groups of cells from tumors with quiet genomes with no baseline. The rest of the analysis will use the results from running InferCNV with endothelial cells as the reference.

Using CNVs to annotate tumor cells

InferCNV uses an HMM-based model to predict whether or not a copy number variation (either loss or gain) is present for each chromosome in each cell. For each predicted CNV region, a posterior probability is calculated for that CNV for each cell.

From the InferCNV documentation:

CNV regions with mean posterior probabilities of being normal (no CNV) that are above a maximum threshold are removed as likely false positive predictions.

The predictions information lies in a few files that live in the output directory. We want to know which cells had CNVs and how many CNVs were found. To map the information about CNVs to individual cells, they recommend using the infercnv::add_to_seurat() function which takes the output directory from running InferCNV and reads in all necessary files and adds the per cell CNV information to the meta.data. So we will start by converting our processed SCE object to a Seurat object, incorporate the CNV information, and then add the CNV information to the SCE object.

infer_cnv_dir <- all_results_dir_list["with_ref"]

# first create a Seurat object from our SCE object
seurat_obj <- scpcaTools::sce_to_seurat(sce)

# add in the CNV information 
# this adds new columns to the meta.data indicating presence of CNVs for each chromosome
seurat_obj <- infercnv::add_to_seurat(seurat_obj = seurat_obj,
                                      infercnv_output_path = infer_cnv_dir)

# grab metadata from Seurat and add back to SCE 
seurat_metadata <- seurat_obj@meta.data |> 
  dplyr::select(-c(orig.ident, nCount_RNA, nFeature_RNA))

colData(sce) <- DataFrame(seurat_metadata, row.names = seurat_metadata$barcodes)

The metadata that we added includes the following columns for each chromosome:

The has columns are boolean columns indicating if that cell has presence of a CNV, loss, or duplication event. If has_cnv is TRUE if either has_loss or has_dupli is TRUE.

See the comments in this issue explaining what the contents of the columns are.

One approach to classifying tumor cells is to look for cells that have a lot of CNVs. Let’s start by just summing up the total CNVs/cell and plotting that.

# make df for plotting
coldata_df <- sce |> 
  scuttle::makePerCellDF(use.dimred = "UMAP") |> 
  # replace UMAP.1 with UMAP1
  dplyr::rename_with(
        \(x) stringr::str_replace(x, "^UMAP\\.", "UMAP")
      )

# remove MT and GL chr, comments in docs recommend that these are not useful and can be inaccurate 
chr_columns_to_keep <- which(!stringr::str_detect(names(coldata_df), "chrMT|chrGL000219.1"))
cnv_df <- coldata_df[, chr_columns_to_keep]
# grab any columns that have has_cnv 
has_cnv_cols <- names(cnv_df)[startsWith(colnames(cnv_df), "has_cnv")]

cnv_df <- cnv_df |>
  # get total number of cnvs per cell and then scale that number 
  dplyr::mutate(has_cnv = rowSums(across(has_cnv_cols)),
                scaled_cnv = scale(has_cnv))
# plot the total number of CNVs
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = has_cnv)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() +
  scale_color_viridis_c()

This shows that the total CNVs are highest in the group of cells that correspond to the same cells that are manually annotated as tumor cells, which is promising. If we were to use total CNVs we would need some sort of cut off for the number. Let’s look at the distribution of total CNVs and also the scaled total CNVs to see if we can identify a cut off.

ggplot(cnv_df, aes(x = has_cnv)) +
  geom_density() +
  theme_bw() +
  labs(x = "Total number of CNVs")

ggplot(cnv_df, aes(x= scaled_cnv)) + 
  geom_density() +
  theme_bw() +
  labs(x = "Scaled number of CNVS")

Looking at the scaled number of CNVs it does look like we could use 0 as the cutoff and call any cell that has > 0 CNVs (after scaling) as tumor and all other cells as normal.

# add cnv classification 
cnv_df <- cnv_df |> 
  dplyr::mutate(cnv_classification = dplyr::if_else(scaled_cnv > 0, "Tumor", "Normal"))

ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = cnv_classification)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() 

Another option is to use the scaled proportion of CNVs which tells us the proportion of genes on that chromosome that are part of the CNV.

proportion_cols <- names(cnv_df)[startsWith(colnames(cnv_df), "proportion_scaled_cnv")]

# get mean proportion of cnvs per cell and then scale that number 
cnv_df <- cnv_df |> 
  dplyr::mutate(mean_proportion = rowMeans(across(proportion_cols)),
                scaled_mean_proportion = scale(mean_proportion))
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = mean_proportion)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() +
  scale_color_viridis_c()

ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = scaled_mean_proportion)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() +
  scale_color_viridis_c()

ggplot(cnv_df, aes(x = scaled_mean_proportion)) +
  geom_density() +
  theme_bw()

It looks like maybe we could again use 0 as the cutoff here and anything above 0 is a tumor cell. Let’s see how this matches up with the tumor/normal classifications we defined just using total CNVs.

ggplot(cnv_df, aes(x = scaled_mean_proportion, color = cnv_classification)) +
  geom_density() +
  theme_bw()

Interestingly, the cells we labeled as tumor have a higher mean proportion value then those that we labeled as normal. I’m not really sure which is “better” to use the mean proportion or just the detection of CNVs to call tumor cells.

For the rest of the notebook, we will use the classifications of tumor cells based on the sum of the total CNV occurrences.

Compare InferCNV annotations to manual annotations

Below we can compare these annotations from InferCNV to classifying tumor cells manually with marker genes.

all_classifications_df <- cnv_df |> 
  dplyr::left_join(manual_classifications_df) |>
  # relevel classifications for confusion matrix 
  dplyr::mutate(
    cnv_classification = forcats::fct_relevel(cnv_classification, "Tumor"),
    marker_gene_classification = forcats::fct_relevel(marker_gene_classification, "Tumor")
  )
## Joining with `by = join_by(barcodes)`
caret::confusionMatrix(
  table(
    all_classifications_df$cnv_classification,
    all_classifications_df$marker_gene_classification
  )
)
## Confusion Matrix and Statistics
## 
##         
##          Tumor Normal
##   Tumor   2293    465
##   Normal    29   1503
##                                           
##                Accuracy : 0.8848          
##                  95% CI : (0.8749, 0.8943)
##     No Information Rate : 0.5413          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7641          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9875          
##             Specificity : 0.7637          
##          Pos Pred Value : 0.8314          
##          Neg Pred Value : 0.9811          
##              Prevalence : 0.5413          
##          Detection Rate : 0.5345          
##    Detection Prevalence : 0.6429          
##       Balanced Accuracy : 0.8756          
##                                           
##        'Positive' Class : Tumor           
## 

This looks pretty consistent with what we see when we annotate cells manually.

We can also compare to the existing annotations obtained using SingleR and CellAssign as part of scpca-nf.

celltype_columns <- c(
  "singler_celltype_annotation",
  "cellassign_celltype_annotation"
)

# create jaccard matrices for SingleR and CellAssign compared to aneuploid/diploid 
jaccard_matrices <- celltype_columns |>
  purrr::map(\(name) {
    make_jaccard_matrix(
      cnv_df,
      "cnv_classification",
      name
    )
  }) |> 
  purrr::set_names("SingleR", "CellAssign")
# Set heatmap padding option
heatmap_padding <- 0.2
ComplexHeatmap::ht_opt(TITLE_PADDING = grid::unit(heatmap_padding, "in"))
# list of heatmaps looking at SingleR/ CellAssign vs tumor/normal 
heatmap <- jaccard_matrices |>
  purrr::imap(
    \(celltype_mat, celltype_method) {
      ComplexHeatmap::Heatmap(
        t(celltype_mat), # transpose because matrix rows are in common & we want a vertical arrangement
        col = circlize::colorRamp2(c(0, 1), colors = c("white", "darkslateblue")),
        border = TRUE,
        ## Row parameters
        cluster_rows = TRUE,
        row_title = celltype_method,
        row_title_gp = grid::gpar(fontsize = 12),
        row_title_side = "left",
        row_names_side = "left",
        row_dend_side = "right",
        row_names_gp = grid::gpar(fontsize = 10),
        ## Column parameters
        cluster_columns = FALSE,
        column_title = "",
        column_title_gp = grid::gpar(fontsize = 12),
        column_names_side = "bottom",
        column_names_gp = grid::gpar(fontsize = 10),
        column_names_rot = 90,
        ## Legend parameters
        heatmap_legend_param = list(
          title = "Jaccard index",
          direction = "vertical",
          legend_width = unit(1.5, "in")
        ),
        show_heatmap_legend = celltype_method == "SingleR",
      )
    }) |>
  # concatenate vertically into HeatmapList object
  purrr::reduce(ComplexHeatmap::`%v%`) |>
  ComplexHeatmap::draw(
    heatmap_legend_side = "right",
    # add a margin to the heatmap so labels don't get cut off
    padding = unit(c(2, 20, 2, 2), "mm")
  )

This looks as we expect with mostly muscle cells lining up with tumor cells.

Check for known CNVs in Ewing’s

There are a few known copy number variations in Ewing’s sarcoma:

Although these are the most frequent, there are patients who do not have any of these alterations and patients that only have some of these alterations. See Tirode et al., and Crompton et al..

Below we will plot whether or not the gain or loss is present in each cell according to InferCNV.

# umaps showing where cells are with gains/losses 
ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = has_dupli_chr8)) +
  geom_point(alpha = 0.5, size = 0.5) +
  labs(title = "Chr8 gain") +
  theme_bw()

ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = has_dupli_chr12)) +
  geom_point(alpha = 0.5, size = 0.5) +
  labs(title = "Chr12 gain") +
  theme_bw()

ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = has_dupli_chr1)) +
  geom_point(alpha = 0.5, size = 0.5) +
  labs(title = "Chr1 gain") +
  theme_bw()

ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = has_loss_chr16)) +
  geom_point(alpha = 0.5, size = 0.5) +
  labs(title = "Chr16 loss") +
  theme_bw()

From this we see that the group of tumor cells at the bottom have chr8 and chr1 gains with some having gains of chr12. It doesn’t look like any cells have loss of chr16. The other interesting thing is that there is a group of cells that have gains of chr8, chr1, and chr12 that we did not classify as tumor cells. Should they be…

We can also look at the proportion of the chromosome with the gain.

ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = proportion_scaled_dupli_chr8)) +
  geom_point(alpha = 0.5, size = 0.5) +
  labs(title = "Chr8 gain") +
  theme_bw() +
  scale_color_viridis_c()

ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = proportion_scaled_dupli_chr12)) +
  geom_point(alpha = 0.5, size = 0.5) +
  labs(title = "Chr12 gain") +
  theme_bw() +
  scale_color_viridis_c()

ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = proportion_scaled_dupli_chr1)) +
  geom_point(alpha = 0.5, size = 0.5) +
  labs(title = "Chr1 gain") +
  theme_bw() +
  scale_color_viridis_c()

Now when we look at the proportion, we see that only the group of tumor cells on the bottom seem to have a higher proportion of genes in chr8 that have a gain in expression. Based on the papers, we should see a gain in the whole chr8 and chr12, not just one arm or specific genes, so we expect a high value for the proportion. We see this for chr8 but not in chr12. However, when you look at the heatmap you see that most of Chr8 is included in the gain, but only a small part of Chr12 has a gain, so these plots are not surprising.

I think with using the combination of marker gene expression and CNVs, the bottom group of cells represent the tumor cells.

Conclusions

Session Info

# record the versions of the packages used in this analysis and other environment information
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.6.6
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] dplyr_1.1.4                 ggplot2_3.5.1               SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
##  [5] Biobase_2.64.0              GenomicRanges_1.56.0        GenomeInfoDb_1.40.0         IRanges_2.38.0             
##  [9] S4Vectors_0.42.0            BiocGenerics_0.50.0         MatrixGenerics_1.16.0       matrixStats_1.3.0          
## 
## loaded via a namespace (and not attached):
##   [1] spatstat.sparse_3.0-3     bitops_1.0-7              lubridate_1.9.3           httr_1.4.7                RColorBrewer_1.1-3       
##   [6] doParallel_1.0.17         tools_4.4.0               sctransform_0.4.1         utf8_1.2.4                R6_2.5.1                 
##  [11] lazyeval_0.2.2            uwot_0.2.2                scpcaTools_0.3.2          GetoptLong_1.0.5          withr_3.0.0              
##  [16] sp_2.1-4                  gridExtra_2.3             parallelDist_0.2.6        progressr_0.14.0          argparse_2.2.3           
##  [21] cli_3.6.2                 formatR_1.14              spatstat.explore_3.2-7    fastDummies_1.7.3         sandwich_3.1-0           
##  [26] sass_0.4.9                labeling_0.4.3            Seurat_5.0.3              mvtnorm_1.2-4             spatstat.data_3.0-4      
##  [31] readr_2.1.5               proxy_0.4-27              ggridges_0.5.6            pbapply_1.7-2             parallelly_1.37.1        
##  [36] limma_3.60.0              rstudioapi_0.16.0         generics_0.1.3            shape_1.4.6.1             vroom_1.6.5              
##  [41] gtools_3.9.5              ica_1.0-3                 spatstat.random_3.2-3     Matrix_1.7-0              futile.logger_1.4.3      
##  [46] fansi_1.0.6               abind_1.4-5               lifecycle_1.0.4           infercnv_1.20.0           multcomp_1.4-25          
##  [51] yaml_2.3.8                edgeR_4.2.0               gplots_3.1.3.1            recipes_1.0.10            SparseArray_1.4.3        
##  [56] Rtsne_0.17                grid_4.4.0                promises_1.3.0            crayon_1.5.2              miniUI_0.1.1.1           
##  [61] lattice_0.22-6            beachmat_2.20.0           cowplot_1.1.3             pillar_1.9.0              knitr_1.46               
##  [66] ComplexHeatmap_2.20.0     rjson_0.2.21              future.apply_1.11.2       codetools_0.2-20          leiden_0.4.3.1           
##  [71] glue_1.7.0                data.table_1.15.4         vctrs_0.6.5               png_0.1-8                 spam_2.10-0              
##  [76] gtable_0.3.5              cachem_1.0.8              gower_1.0.1               xfun_0.43                 S4Arrays_1.4.0           
##  [81] mime_0.12                 prodlim_2023.08.28        libcoin_1.0-10            coda_0.19-4.1             survival_3.5-8           
##  [86] timeDate_4032.109         iterators_1.0.14          hardhat_1.3.1             lava_1.8.0                statmod_1.5.0            
##  [91] fitdistrplus_1.1-11       TH.data_1.1-2             ROCR_1.0-11               ipred_0.9-14              nlme_3.1-164             
##  [96] bit64_4.0.5               RcppAnnoy_0.0.22          rprojroot_2.0.4           bslib_0.7.0               irlba_2.3.5.1            
## [101] KernSmooth_2.23-22        rpart_4.1.23              colorspace_2.1-0          nnet_7.3-19               tidyselect_1.2.1         
## [106] bit_4.0.5                 compiler_4.4.0            DelayedArray_0.30.1       plotly_4.10.4             scales_1.3.0             
## [111] caTools_1.18.2            lmtest_0.9-40             stringr_1.5.1             digest_0.6.35             goftest_1.2-3            
## [116] spatstat.utils_3.0-4      rmarkdown_2.26            XVector_0.44.0            htmltools_0.5.8.1         pkgconfig_2.0.3          
## [121] sparseMatrixStats_1.16.0  highr_0.10                fastmap_1.1.1             rlang_1.1.3               GlobalOptions_0.1.2      
## [126] htmlwidgets_1.6.4         UCSC.utils_1.0.0          shiny_1.8.1.1             DelayedMatrixStats_1.26.0 jquerylib_0.1.4          
## [131] farver_2.1.1              zoo_1.8-12                jsonlite_1.8.8            BiocParallel_1.38.0       ModelMetrics_1.2.2.2     
## [136] magrittr_2.0.3            modeltools_0.2-23         scuttle_1.14.0            GenomeInfoDbData_1.2.12   dotCall64_1.1-1          
## [141] patchwork_1.2.0           munsell_0.5.1             Rcpp_1.0.12               ape_5.8                   reticulate_1.36.1        
## [146] stringi_1.8.4             pROC_1.18.5               zlibbioc_1.50.0           MASS_7.3-60.2             plyr_1.8.9               
## [151] parallel_4.4.0            listenv_0.9.1             ggrepel_0.9.5             forcats_1.0.0             deldir_2.0-4             
## [156] splines_4.4.0             tensor_1.5                hms_1.1.3                 circlize_0.4.16           locfit_1.5-9.9           
## [161] igraph_2.0.3              fastcluster_1.2.6         spatstat.geom_3.2-9       RcppHNSW_0.6.0            reshape2_1.4.4           
## [166] futile.options_1.0.1      evaluate_0.23             SeuratObject_5.0.1        RcppParallel_5.1.7        lambda.r_1.2.4           
## [171] renv_1.0.7                BiocManager_1.30.23       tzdb_0.4.0                phyclust_0.1-34           foreach_1.5.2            
## [176] httpuv_1.6.15             RANN_2.6.1                tidyr_1.3.1               purrr_1.0.2               polyclip_1.10-6          
## [181] clue_0.3-65               future_1.33.2             scattermore_1.2           coin_1.4-3                xtable_1.8-4             
## [186] e1071_1.7-14              RSpectra_0.16-1           later_1.3.2               viridisLite_0.4.2         class_7.3-22             
## [191] rjags_4-15                tibble_3.2.1              cluster_2.1.6             timechange_0.3.0          globals_0.16.3           
## [196] caret_6.0-94